Boot-Strapping Language Identifiers for Short Colloquial Postings
نویسندگان
چکیده
There is tremendous interest in mining the abundant user generated content on the web. Many analysis techniques are language dependent and rely on accurate language identification as a building block. Even though there is already research on language identification, it focused on very ‘clean’ editorially managed corpora, on a limited number of languages, and on relatively large-sized documents. These are not the characteristics of the content to be found in say, Twitter or Facebook postings, which are short and riddled with vernacular. In this paper, we propose an automated, unsupervised, scalable solution based on publicly available data. To this end we thoroughly evaluate the use of Wikipedia to build language identifiers for a large number of languages (52) and a large corpus and conduct a large scale study of the best-known algorithms for automated language identification, quantifying how accuracy varies in correlation to document size, language (model) profile size and number of languages tested. Then, we show the value in using Wikipedia to train a language identifier directly applicable to Twitter. Finally, we augment the language models and customize them to Twitter by combining our Wikipedia models with location information from tweets. This method provides massive amount of automatically labeled data that act as a bootstrapping mechanism which we empirically show boosts the accuracy of the models. With this work we provide a guide and a publicly available tool [1] to the mining community for language identification on web and social data.
منابع مشابه
Bootstrapping Knowledge About Social Phenomena Using Simulation Models
There are considerable difficulties in the way of the development of useful and reliable simulation models of social phenomena, including that any simulation necessarily includes many assumptions that are not directly supported by evidence. Despite these difficulties, many still hope to develop quite general models of social phenomena. This paper argues that such hopes are ill-founded, in other...
متن کاملA Study of Colloquial Language in Jalal Al-e-Ahmad’s Fictions
As the most prominent novelist in contemporary Persian prose, Jalal Ale-Ahmad has had great influence on Persian writers, insofar as many writers have followed his suit. Employment of colloquial language is the characteristic style of his fiction. What makes his different, however, is mainly the employment of colloquialism in a subtle, precise and accurate way. Due to the extensive use of collo...
متن کاملImproved Skips for Faster Postings List Intersection
Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...
متن کاملImproved Skips for Faster Postings List Intersection
Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...
متن کاملBootstrapping Without the Boot
What: We like minimally supervised learning (bootstrapping). Let’s convert it to unsupervised learning (“strapping”). How: If the supervision is so minimal, let’s just guess it! Lots of guesses lots of classifiers. Try to predict which one looks plausible (!?!). We can learn to make such predictions. Results (on WSD): Performance actually goes up! (Unsupervised WSD for translational senses, Eng...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013